CAGEF_services_slide.png

Advanced Graphics and Data Visualization in R

Lecture 05: Dimension Transformation

0.1.0 An overview of Advanced Graphics and Data Visualization in R

"Advanced Graphics and Data Visualization in R" is brought to you by the Centre for the Analysis of Genome Evolution & Function's (CAGEF) bioinformatics training initiative. This CSB1021 was developed to enhance the skills of students with basic backgrounds in R by focusing on available philosophies, methods, and packages for plotting scientific data. While the datasets and examples used in this course will be centred on SARS-CoV-2 epidemiological and genomic data, the lessons learned herein will be broadly applicable.

This lesson is the fifth in a 6-part series. The aim for the end of this series is for students to recognize how to import, format, and display data based on their intended message and audience. The format and style of these visualizations will help to identify and convey the key message(s) from their experimental data.

The structure of the class is a code-along style in Jupyter notebooks. At the start of each lecture, skeleton versions of the lecture will be provided for use on the University of Toronto Jupyter Hub so students can program along with the instructor.


0.2.0 Lecture objectives

This week will focus on exploring the relationships within your data observations comparing between experimental samples and applying a suite of cluster, dimensions reduction and projection algorithms.

At the end of this lecture you will have covered visualizations related to

  1. Clustering with heatmaps and dendrograms
  2. K-means clustering
  3. Principal component analysis
  4. Non-linear projections

0.3.0 A legend for text format in Jupyter markdown

grey background - a package, function, code, command or directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or folder
bold - heading or a term that is being defined
blue text - named or unnamed hyperlink

... - Within each coding cell this will indicate an area of code that students will need to complete for the code cell to run correctly.

Blue box: A key concept that is being introduced
Yellow box: Risk or caution
Green boxes: Recommended reads and resources to learn Python
Red boxes: A comprehension question which may or may not involve a coding cell. You usually find these at the end of a section.

0.4.0 Data used in this lesson

Today's datasets will focus on the smaller datasets we worked on in earlier lectures (namely our Ontario public health unit COVID-19 demographics data), and a new set of RNAseq analysis on different tissue samples from COVID-19 patients

0.4.1 Dataset 1: phu_demographics_census_norm.csv

This is a combination of datasets from previous lectures. This incorporates PHU demographic data with StatsCan census data from 2017 to produce a normalized estimate of cases, deaths, and hospitalizations across age groups and public health units in Ontario.

0.4.2 Dataset 2: Wyler2020_AEC_SARSCoV2_17AAG_readcounts.tsv

This is the same readcount data we looked at in Lecture 04. RNA-Seq read count data generated from SARS-CoV2 infections of AEC cells. Used to compare the timecourse of expression (pro-inflammatory) changes in samples treated with and without HSP90 inhibitors. Published in iScience doi: https://doi.org/10.1016/j.isci.2021.102151

0.4.3 Dataset 3: GSE150316_DeseqNormCounts_final.txt

From Desai et al., 2020 on medRxiv doi: https://doi.org/10.1101/2020.07.30.20165241 this dataset has normalized expression counts from RNAseq data. It covers multiple samples and tissues from COVID-positive patients with a focus on lung tissue. The expression data has been unfiltered for SARS-CoV-2 expression data as well.

0.4.4 Dataset 4: 2020.07.30.20165241-supp_tables.xlsx

From Desai et al., 2020 on medRxiv doi: https://doi.org/10.1101/2020.07.30.20165241 this dataset contains patient information like viral load from tissues that were used for RNAseq.


0.5.0 Packages used in this lesson

repr- a package useful for altering some of the attributes of objects related to the R kernel.

tidyverse which has a number of packages including dplyr, tidyr, stringr, forcats and ggplot2

viridis helps to create color-blind palettes for our data visualizations

RColorBrewer has some hlepful palettes that we'll need to colour our data.

gplots will be used to help generate heatmap/dendrogram visualizations.

FactoMineR and factoextra will be used for PCA generation and visualization

Rtsne and umap are packages implementing non-linear projection algorithms


1.0.0 Data categorization, dimension reduction, and dimension transformation

Last week we looked at an analysis of RNAseq data through a number of methods starting with broad-level volcano plots and moving towards gene-level expression visualizations with dotplots. In between we stopped to take a look at heatmaps. In this instance we simply used heatmaps to convey expression levels of multiple genes across one or more samples.

Looking more broadly, we now wish to ask questions such as "how similar are our samples?", "can our samples be grouped or categorized in some way?" and "is there an underlying structure or architecture to our data?"

We can study these espects using a number of techniques that range from clustering/categorization to projection of high-dimensional data onto a lower set of dimensions (usually 2 or 3).


1.1.0 What kind of data do we care about?

We cannot begin our journey until we talk about the nature of the data we are interested in examining. Usually, our data will consist of many samples (observations) with some number of features captured about each observation. For example, with RNAseq data we could consider the measurements we capture for each gene as a separate feature/variable/column.

Conversely you may have hundreds or thousands of samples you'd like to categorize in some way (or show that you can categorize) with just a smaller set of features. For every nut, there is a tool of sorts for cracking it!

We'll start with a modified version of the PHU age group dataset that we've seen before from Lecture 02 and 03. The demographics data has been modified to include a set of normalized values (individuals per 100k) that was calculated based on StatsCan 2017 census data. Modeling off of these data, the 2022 sizes for each age group have been estimated and case, death, and hospitalization counts have been normalized based on these subgroup sizes.

The updated dataset will be found in phu_demographics_census_norm.csv and we'll use it to guide us through the two sections of today's lecture.

Let's begin by loading the data and taking a look at its structure.


1.2.0 Looking for trends/groups in your data

Let's begin looking at our covid_demographics_norm.df. Using a series of data sets, we've created a consolidated dataset with:

  1. Cumulative cases, deaths, and hospitalizations due to COVID-19 within each age group per PHU.
  2. Representation of each age group within each data category as a percent of total incidents in each PHU.
  3. Using 2017 census data, the number of cases per 100,000 individuals normalized by estimated population size for each age group within each PHU.

The question we want to answer is: of the 34 public health units, which look most similar based on the normalized data for each age group? In order to visualize this data, we'll want to convert our current long-form data that looks something like this:

Our starting table:

public_health_unit age_group total_cases ... cases_per_100k deaths_per_100k hospitalizations_per_100k
Algoma 0 to 4 181 ... 3302 0 164
Algoma 12 to 19 412 ... 4464 0 54
... ... ... ... ... ... ...

Our end result:

public_health_unit population_2022 0 to 4 5 to 11 12 to 19 20 to 39 40 to 59 60 to 79 80+
Algoma 117840 3302 4464 7570 4301 4890 2331 4116
... ... ... ... ... ... ... ... ...

We need to do the following to the dataset

  1. Select just the variables we are interested in.
  2. pivot out the age group data into it's own columns so we have 1 observation (row) per PHU.
  3. Correct the position of the age groups (5 to 11 needs to be moved).

1.2.1 Cast our data to a matrix for heatmaps

At this point, we would normally be prepared to visualize our data with ggplot but our data is in wide-format and we'll be using a package that prefers to work with matrix data. In that case, we need to strip the data down further because matrix data must be all of the same type and we want to work with numeric data! We'll use as.matrix() for our conversion.

  1. We'll move the public_health_unit information over to the row names of our matrix so we can still track our data properly.
  2. We'll still keep our population_2022 column of data but we'll have to remember that it's there.

1.3.0 Plot and reorder our PHUs as a heatmap with heatmap.2()

From the gplots package we can use the function heatmap.2() to generate a heatmap that can do a little more than what we were doing with our ggplot2 version. The nice aspect of using heatmap.2() is that we can ask it to reorder our samples along both the rows and columns. At the same time we can present the relationship between columns elements or row elements in the form of dendrograms.

1.3.1 How do we reorder our data via clustering?

In its current form, we have our ordered our data along the columns in an ascending ordinal arrangement. While this makes sense to us now, it may not help to visually identify the strongest trends or groupings in our data. Clustering attempts to bring order to our data by grouping data according to specific algorithms.

Overall single points are usually grouped together as neighbours with the "nearest" neighbours determined by a metric of some kind. These clusters are then further grouped using the same metrics until the entire set is presented in these ordered groups. These groupings/relationships can also be presented as dendrograms. In our current case, we aren't concerned with determining groups per se but rather with just connecting them by their similarity and then creating a hierarchical order.

Our data is usually coded in n-dimensional space, depending on the nature of our dataset. For covid_demographics_norm.mx our 7 columns are coded by data from 34 PHUs meaning each column is coded in 34 dimensions or 34 features. Conversely, our 34 rows represent PHUs each coded by data across 7 age groups and therefore 7 dimensions.

Important or helpful parameters to run heatmap.2()

Let's plot our data with and without a dendrogram to compare.


1.4.0 Building correlation heatmaps with complex RNA-Seq data

Our heatmap above is drawn from a relatively simple dataset but it helps us to see that there are some strong signals that differentiate between our different PHUs. Now this is a 34x7 grid where we can investigate all of the data in our dataset. What happens when we want to produce this kind of data from something more complex?

Recall from last lecture we investigated read count data from Wyler et al., 2020 - Wyler2020_AEC_SARSCoV2_17AAG_readcounts.tsv. We used this dataset to produce a scatterplot matrix to compare between some of the replicate data within the set. Across this set there are 36 sets of RNA-Seq data spanning 27,011 genes.

Let's begin by opening up the data file and getting a quick reminder of what it looks like.


1.4.1 Wrangle our readcount data

As you might recall from last week, there is quite a bit of data in this set. So that we can more easily visualize our data, we'll want to wrangle it a bit more by:

  1. updating the column names to remove some of the unnecessary title information.
  2. filtering by readcounts to limit our genes to those with values between 1000 and 3000.
  3. Convert our dataframe to a matrix and rename the rows using the genes.

1.4.2 Plot a heatmap of your readcount data

What if we were to produce a heatmap directly with the raw data? It would be impossible to read, and including a dendrogram would actually take forever to process given the 27,011 rows of data to process across the 36 datasets. That's why we'll use our small subset of data as an example to build a heatmap.

Let's plot the data now with heatmap.2 and see how it looks with a dendrogram.


1.4.3 Generate a correlation matrix of your data

From the above output, we can see that our heatmap is already quite hard to read and that's coming off of using just 109 genes! We do, however, get some strong trends regarding our datasets - we can see certain sets of replicates are grouped together in the data but not all of them. This might be clearer if we were to factor in all of the datapoints or just the relevant genes that define the differences between datasets. We'll discuss the second part of that thought later in this lecture.

Recall that what we're really interested in, regarding these readcount data, is to ask just how similar the experiments are between each other. Whereas we were a little limited by the space needed to produce the scatterplot matrix, we were able to produce correlation values between experiments on a small scale. We can extend that visualization forward to compare between all datasets and then visually summarize the data as a heatmap.

The portion of the scatterplot matrix we'll extend is the correlation values. Whereas the ggpairs() function uses a Pearson correlation, we can choose between Pearson or Spearman correlation. Sometimes you may wish to compare both since Pearson examines linear relationships whereas Spearman uses a ranking method to compare variables for monotonic relationships.

To generate our matrix we'll employ a couple of additional functions:


1.4.4 Generate a heatmap of your correlation matrix

Now that we've completed the correlation matrix and it appears to be correct, we can generate the heatmap of the data, allowing it to group data based on the Pearson correlation values.


1.5.0 Make a cluster dendrogram with hclust()

As you can see, clustering our data makes a big difference in how our data is displayed. In our last few heatmaps, we allowed heatmap.2() to generate it's own clustering. We could have supplied it with a specific dendrogram to order the data as well. Under the hood, the function is actually calling on the hclust() function to produce the dendrograms.

The choice of method determines how the data is clustered. The choices include: ward.D, ward.D2, single, complete (default), average, mcquitty, median and centroid.

In general ward.D, ward.D2 and complete strategies try to find compact small "spherical" clusters. The single linkage adopts a 'friends of friends' clustering strategy. The other methods can be said to aim for somewhere in between. For more information on these, you can dig into the hclust() documentation

We can turn to hclust() to generate just dendrograms for us but prior to that we still need to reformat our matrix a little using the dist() function.


1.5.1 Calculate the distance between our points with dist()

Remember we talked about our data being in n-dimensional space? Using those coordinates, there are a number of ways to calculate the distance between points. The simplest form is to calculate the euclidean distance between points using the generic formula:

$$d(p,q) = \sqrt{(p_{1}-q_{2})^2 + (p_{2}-q_{2})^2 + \ldots + (p_{n}-q_{n})^2}$$

but there are a number of other options as well. For isntance you can also choose the maximum distance between two components (same dimension). The dist() function can generate these values for you using the parameter method to determine how the distance will be used. Your options are: euclidean, maximum, manhattan, canberra, binary or minkowski.

The dist() function will return a dist object which is a lower triangle distance matrix between all of the rows/observations using the columns as coordinates.

Let's return to our PHU data to try it out and see how it works.


1.5.2 Cluster your distance matrix

Now we are ready to cluster our distance matrix using the hclust() method

Parameters that are important


1.5.3 Plot your cluster object using fviz_dend() from the package factoextra

Now that we have generated our clustering object, you can see that it holds a number of features including the height (length) of our branches, an ordering of our PHUs and the order in which they merge. The merge process is described in more detail as well with the hclust() documentation

To simplify the plotting process, we can use the fviz_dend() function which will parse through the hclust object to produce a proper dendrogram. We can even specify some grouping or colouring schemes based on how many groups, k, we believe are present in our data.

We can treat the resulting plot like a ggplot to update particular elements as well.


2.0.0 Clustering data into groups

So we've visualized our dataset as a dendrogram and it gives us an idea of how the samples might be related (in n-dimension space) based on the values we produced from normalization.

K-means clustering is unsupervised learning for uncategorized data. We don't really know the training labels of our data and are more interested in seeing which ones group together. How many clusters should we aim to generate? We've already seen that our data likely splits into 3 groups based on the heatmaps and hclust() data. Will we get the same thing with a k-means method?

When in doubt, you can quickly consult the factoMiner function fviz_nbclust() which can guess how many clusters are ideal for representing your data. Note that it may not produce your ideal number of clusters but if you're stuck, this is a good start.

There are three method options: wss, gap_stat, and silhouette which all aim to minimize the number of clusters.


2.1.0 Generate your clusters using kmeans()

So from three different analyses we didn't really get a concensus on the best number of clusters but it looks like the answer ranges between 1 and 4 clusters. Since our data already suggests 3 clusters, let's start with that. We'll use the kmeans() function to accomplish this.

Similar in idea to hclust(), the kmeans() algorithm attempts to generate k-partioned groups from the data supplied with an emphasis on minimizing the sum of squares from points to the assigned cluster centers. This differs from hclust() which finishes building the entire relationship via dendrogram without actually choosing "clusters".

We're going to also use set.seed() for our random number generation. We tend to think of things as random, but computationally, randomness is built on algorithms. Therefore we can "recreate" our randomness if we use a pre-determined starting point also known as the "seed".

The parameters we're interested in using with kmeans() are:


2.2.0 Plot your k-means results with fviz_cluster()

Notice the structure of this output? It includes cluster which denotes which row belongs to each of the 3 clusters chosen. The centre of each cluster is defined by centers which in this case is a 3x7 array where each row uses 7 values to define their position within our 7-dimension dataset. We can see each cluster's size is 18, 11, and 5 PHUs respectively. Notice, however, that none of the original coordinates for our data are retained in this object.

Rather than use ggplot directly, we'll use the fviz_cluster() function to help us transform the values of our points from a 7-dimension coordinate system to a 2-dimension visual format suitable for our simpler brains. Under the hood fviz_cluster() is performing a principle component analysis to group our data along the first two principal components (more on what that means later).

The parameters we should be concerned with in this function include:

Let's see how our data has been partitioned by the k-means clustering.


2.3.0 Clustering your RNA-Seq data

Let's return to the Wyler et al., 2020 readcount data and take a look at it through a different lens. Whereas before we had generated a heatmap of the data based on looking at genes expressed within a readcount range, we'll now take a different approach.

Let's look at the top 500 variable genes in the dataset to help cluster them. To accomplish that we'll want to generate the standard deviation in read count for each row (gene). We'll utilize a few verbs we haven't used before in this class:


Now we'll just convert our dataframe into a matrix of values so we can perform our various analyses.


2.3.1 Create a heatmap of the most variable genes

Since we're here, let's take a quick step back and look at the heatmap from our new dataset. Does it reveal anthing to us now that it is more nuanced?


2.3.2 Generate a k-means cluster analyis

Looks like the heatmap still doesn't reveal a lot of information to us like how samples might be grouped. Let's proceed with k-means and see if that works better. We just need to repeat our steps on a different set of data. Since we now have our most diverse genes and their readcounts, we'll take a look at if we can cluster this information.

  1. Generate an estimate on the appropriate number of clusters
  2. Create our k-means cluster object
  3. Visualize the k-means object and see which experiments tend to group together.

2.3.3 Interpreting our RNA-Seq clustering

Looking at the result, what's nice about this kind of output is that we can immediately see the separation or clustering of our data. It looks like 4 was the way to go as there are 4 tight groupings from our data. It might be possible to further subdivide the group in the top left of our figure but it's unlikely.

Based on our selection of genes, we observe:

  1. 24-, 48-, and 72-hour samples mock-treated (DMSO) and mock-infected, cluster together with 24-hour mock-treated samples infected with SARS-CoV-2.
  2. 48- and 72-hour samples mock-treated (DMSO) and infected by SARS-CoV-2 appear to share a similar profile.
  3. 24-hour samples treated with 200 nM 17AAG cluster together whether or not they are infected with SARS-CoV-2.
  4. 48- and 72-hour samples treated with 200 nM 17AAG cluster together whether or not they are infected with SARS-CoV-2.

These groupings suggest that perhaps the 24-hour SARS-CoV-infected timepoint is similar to uninfected controls. Meanwhile the 48- and 72-hour SARS-CoV-infected samples are similar but likely showing very distinct transcriptional changes due to infection. Treatment with the HSP90-inhibitor, however, appears to sufficiently alter expression profiles of infected and mock-infected cells to makes them cluster together. This would be a good starting point to being digging further into the data.

Skeptic or believer? While our above analysis seems to make sense to us, it lacks a deeper understanding of what might be happening. For one thing, we haven't even looked closely at the genes used to create our clustering. How many of them are relevant to the infection process? Is the clustering due to off-target effects of the HSP90 inhibitor (17AAG)? While clustering is an effective way to quikcly identify if subgroups exist within your data, it's important to follow-up these findings with in-depth analyses of the data.
squid_PCA.png
Yes the data clusters, but why does it cluster?

3.0.0 Dimension reduction trims the (data) fat

One problem we encountered in our above analysis of RNA-Seq data is that there were simply too many dimensions! With ~27k gene entries in our dataframe, using clustering to analyse the entire dataset would be memory-intensive if not impossible altogether. To circumvent this problem we attempted to subset our data in two ways - filtering by read counts, and comparing variance across datasets. These approaches were a form of dimensionality reduction - namely feature elimination. Our choices, however, may have inadvertantly been throwing away data that could prove insightful! This is where other dimensionality reduction methods can help guide our analyses.

You can achieve dimensionality reduction in three general ways:

  1. Feature elimination: you can reduce the feature space by removing unhelpful features. No information is gained but you trim down your dataset size.
  1. Feature selection: on the reverse side you can simply choose the most important features to you. How will you decide? Some schemes include ranking your features by importance but this may suffer from information loss if incorrect features are chosen.
  1. Feature extraction: create new independent features which are a combination of the old features! Attempt to extract the essential features of your data in a more informationally dense manner.
Technique Description Type Useful for
Random forest Popular machine learning algorithm for classification randomly chooses from subsets of features to classify data. The most frequently appearing features amongst the forests are chosen. Feature selection Only takes numeric inputs
PCA Principal component analysis attempts to maximize variation when transforming to a lower-dimensional space. All new features are independent of one another. Linear Feature Extraction Actually really good at finding outliers in RNAseq data
t-SNE t-Distributed Stochastic Neighbour Embedding is a non-linear technique similar to PCA suitable for high-dimension datasets. It attempts to minimize probabilities in mirroring nearest neighbour relationships in transforming from high to low-dimension spaces Non-linear Feature extraction Determine if your data has underlying structure
UMAP Uniform manifold approximation and projection is projection-based like t-SNE, this technique attempts to preserve local data structure but has improved translation of global data structure. Non-linear feature extraction Faster than t-SNE

Since we're not exactly building a classifier but rather trying to find trends in our data, we won't be looking at Random Forests here. Here we are interested in exploratory data analysis. We have a data set we want to understand, sometimes it is too complex to just project or divine 1) the underlying structure and 2) the features that drive that structure.


3.1.0 Reduce your dimensionality with PCA

Going back to our PHU data, we used 7 dimensions to classify our data but what if we wanted to transform that information in some way to reduce the number of dimensions needed to represent their relationships? For our data set, 7 dimensions isn't a lot of features but in other cases (like RNA-Seq) you might encounter feature-dense datasets and without knowing a priori which ones are meaningful, PCA provides a path forward in classifying our data.

Now there are caveats to PCA. It produces linear feature extraction by building new features in your n-dimensional space such that each new feature (kind of like a projected line) maximizes variance to the original data points. Each new component must be uncorrelated with (ie perpendicular to) the previous ones while accounting for the next highest amount of variance. More resources on how this works in the references.

Principal%20Component%20Analysis%20second%20principal.gif
How do we go about maximizing the variance (spread of red dots) to our data features? Finding the point where variance is maximized, also minimizes error (red line lengths). Generated by user Amoeba on stackexchange.com

All math aside, our goal is to reduce our feature set to something smaller by trying to represent our data with these new features. Just remember that highly variable data and outliers can dominate your principal components.

For simplicity let's head back and look at our PHU age group data again. To illustrate our example with PCA, let's use the original data normalized by PHU population.


3.2.0 Use PCA() to generate our analysis

To generate our analysis of the PHU data, we'll use the FactoMineR function PCA() for which there are some parameters we'll be using that we should discuss:

3.2.1 What does it mean to scale our data?

Remember that PCA is trying to maximize distance between a principle component and all of the observations supplied. Depending on the nature of your variables you may have, for instance, two different unit types like height and mass. Smaller changes in height may be matched with much larger changes in mass or just wider overall variance. This may lead the PCA algorithm to prioritize mass over height when you'd prefer they have an equal importance. By centering your mean and scaling data to unit variance, everything is compared as a z-score, bringing the overall variance across a variable to within ~3 standard deviations.

Let's compare PCA with and without scaling shall we?


3.2.2 Our PCA object has a complex number of data pieces

Regardless of scaling, the result of our PCA() call produces an object with many variables we can access. Above you can see a brief description for each variable but we are most interested in a few particular ones:


3.3.0 Use the eigenvalues to determine the percent variance of each component

The eigenvalues from our analysis pair with the eigenvectors (principle components) to help transform our data from the original feature set to the new set of features. While the eigenvectors may determine the directions of the new feature space, the eigenvalue represents the magnitude of the vector and in this case can be used to calculate the percent of overall variance explained by our eigenvector.

The important take-away is that we can now see just how much of our variance is explained in each new principle component. We can access this information directly from phu_scaled.pca or by using the function get_eigenvalue(). We can also plot this as a barchart instead using fviz_eig().


3.3.1 Build a scree plot to look at the effect of your dimensions

See how nearly 80% of our variation is explained in just our first dimension? Let's use fviz_eig() to display this information visually.


3.3.2 Most of our variance is accounted for in the first two principle components!

Looking at our eigenvalues we can see that even though 7 new PCs were generated, the first two explain almost 92% of our variance. That suggests that whatever linear separation in our data exists, we can recreate in a two-dimensional projection of coordinates from PC1 and PC2. This suggests that we can recreate the underlying structure of our data with these two new features instead of using a 7-dimensional space!

This makes some sense when we take a closer look at the data. For instance, there isn't much visual information found in our 0 to 4 age range. The small distribution of information in these features may not do much, in the grand scheme, to change how the PHUs are separated from each other. Then again, perhaps they do contain information that we simply cannot see easily! How will we know?


3.4.0 Investigate your variables with get_pca_var()

Within our PCA, we can access information regarding how our original variables are transformed into the new space. This is all stored in the var element of our PCA object. We can extract the aspects of this using the get_pca_var() and visualize these to determine the quality of our variables and their representation by the new principal components.


3.4.1 What is the contribution of each variable to the new components?

Each original variable may, to some degree, influence the amount of information captured into each new principal component. When we look at each we can see the breakdown of variables, which in some cases, can reveal to us where more or less variation is found as well.

From above we can see that our original variables equally contribute to our first principal component but there is much more contribution in PC2 from three variables: 0 to 4, 5 to 11 and 80+. From our previous scree plot that means that 12.8% of overall variation, which is described in PC2, is mainly from these three groups.

How many dimensions should I get and how many do I keep? It is at this point that we should discuss the maximum number of possible dimensions. Given n observations, and p features, the maximum number of dimensions after reduction is min(n-1, p). In terms of how many dimensions you keep, you should consider the general rule of thumb that at least 80% of your variation should be accounted for by the principal components that you choose. Keep that in mind!

3.4.2 The quality and representation of your variables in the PCA

From the remaining variable information coord and cor can both be used to plot our original variables along different pairs of principal components. These values are the same because this is not a projection of our variables onto the PCs but a correlation of them! The distance between the origin and the variable coordinates gauges the quality of the variables on the map. This number is summarized in the cos2 (squared cosine) element.


3.4.3 Plot your variable information on a unit circle with fviz_pca_var()

To capture all of our information in a single visualization we can use fviz_pca_var() to assess our variables. High quality variables will be closer to the circumference of the circle, low quality correlations will be closer to the origin. As we will verify from above, most of our variables are positively correlated with PC1. Across both PC1/PC2, our variable correlate highly.


3.5.0 Check if PHUs cluster or separate with fviz_pca_ind()

Now that we've generated our PCA, let's see if there is any clear separation between our PHUs across the first two dimensions of new features. Much like our variables, all of the individual coordinate data can be found in our ind element of our PCA object. Within there we'll find the coord information so we could directly build a ggplot with phu_scaled.pca$ind$coord BUT there's a simpler way.

We'll parse through the PCA object with fviz_pca_ind() to plot our data along the first two principle coordinates. We can define pointsize and col.ind to help add some population size information to our PHUs.


3.5.1 Scaled vs unscaled data

As you can see from our graphing of both PCA sets, when we choose not to scale our data something a little more drastic happens. PC1's portion of variance increases to a 83.2% accounting of overall variance. We see many of our points move and separation between some of our PHUs is increased (Kingston/Frontenac and Peel). These changes are likely due to the larger range of values from the 20 to 39 age group which now contributes to 31% of PC1's variation.

So think carefully about your features and what they represent. Depending on their range, and unit values, you may be inadvertantly weighing them more than your other features! Scaling adjusts your data so that all features are weighted equally!

3.6.0 Can we cluster our transformed data?

Now that we've effectively reduced the complexity of our data, can we produce the same kmeans clustering as before? Let's try to cluster just on the two dimensions presented, which we have already graphed!


3.6.1 Was it worth it to use PCA?

From our estimations, the clustering choices look near identical and this is not exactly the best dataset to work with since it is rather small but you can see that we have now transformed our data into just 2 features! Could we do the same using just 2 features from our original dataset? No! So imagine transforming a much larger and feature-heavy dataset into a smaller number of features. We are also able to extract, from the data just which of our original features may really contribute to each new principal component!

Something to consider the next time you're working with a large data set!


3.7.0 PCA on a large RNA-Seq dataset

Now that we've walked through how to generate a PCA for a simpler dataset, let's return to our RNA-Seq readcount data. We won't trim down the data at all but rather just provide the entire dataset as a matrix for feature reduction. Let's review the steps:

  1. Generate a matrix of your data, naming the columns and rows by whatever information you might have. Make sure your columns represent your features, and rows are observations.
  2. Create the PCA object.
  3. Review your eigenvalues and eigenvectors to assess how well your reduction worked. Determine how many dimensions you wish to use in further analysis.
  4. Calculate the number of possible clusters.
  5. Plot the results.

4.0.0 Non-linear projection

t-SNE_UMAP_examples.png
How do we identify trends or groups within deeply complex data in an unsupervised manner?

So we just spent most of our time trying to understand how to transform our sets based on the large amounts of variation in our data. There are some limitations we've discussed but in some cases, depending on your dataset size you may be interested in finding small, local similarities rather than the large variation that comes with PCA.

There are two popular projections we'll discuss today but first let's put together a more complex dataset based on some RNAseq data in GSE150316_DeseqNormCounts_final.txt and it's companion file 2020.07.30.20165241-supp_tables.xlsx


4.0.1 Reformat our patient meta data

First let's examine our patient_data.df (24 patient samples listed in total) and reformat the column names a bit before selecting for just the viral load and viral load percentage information located in the 2nd and 3rd column. We'll hold onto this in patient_viral_load.df for later use.


4.0.2 Prepare our RNAseq tissue data for analysis by eliminating features

We now want to format tissue_data.df much in the way we did with our PHU data. We want to convert our current data which lists genes as observations and tissue samples as columns. Essentially, we'd like to transpose this and we could do that but the transposition converts everything to a matrix. In the end, we want to work with a data frame so we can hold more information.

First, however, to reduce on memory and runtime, we should trim our dataset. We aren't really interested in looking at 59090 genes - many of which may be barely expressed. Since these are normalized counts across the dataset, we can filter out low-expression genes to make tissue_data_filtered.df. Yes, this would again be considered a form of feature elimination. In general we will:

  1. Convert the row names to their own column.
  2. Calculate the mean of the normalized counts across each observation and filter for a minimum of 0.5.
  3. Calculate the variance across each observation as a backup plan.

4.0.3 More data wrangling to transpose our RNAseq data

Now that we've filtered our data down to ~29k genes, we'll run through two more steps:

  1. We'll tranpose our data using a combination of pivot_longer() and pivot_wider().
  2. We'll use a series of string matches and joins to add some sample information to our data. It won't be used for our analysis but will be used when we plot the results!
  3. Lastly we'll convert the relevant parts of our data frame to a matrix. There is a huge memory savings in the algorithm when working with a matrix and you have limited memory on your Jupyter Hubs!

4.1.0 t-Distributed Stochastic Neighbour Embedding with the Rtsne package

We now have a somewhat more complex dataset. We are still short on actual samples (now observations) but 90 observations and nearly 30K features isn't so bad. A broad question we may wish to ask with such a data set is, if there is an underlying structure to these samples - such as do we see grouping based on tissue type, or perhaps even sample preparation.

t-Distributed Stochastic Neighbour Embedding or t-SNE is a way for us to project our high-dimension data onto a lower dimension with the aim at preserving the local similarities rather than global disparity. When looking at data points, t-SNE will attempt to preserve the local neighbouring structure. As the algorithm pours over the data, it can use different transformations for different regions as it attempts to transform everything to a lower dimension. It is considered "incredibly flexible" at finding local structure where other algorithms may not.

This flexibility is accomplished through 2 steps:

  1. Reduce dimensionality of your data features with PCA!
  2. Generate a probability distribution between all pairs by making similar objects highly probably and assigning dissimilar objects a low probability.
  3. Define a similar probability distribution for the samples in a lower dimension while minimizing the divergence between the two distributions based on a distance metric between points in the lower dimension.

We'll discuss more about how this algorithm affect interpretation after seeing the results, but this is considered an exploratory data visualization tool, rather than explanatory.

To produce our t-SNE projection we'll use the Rtsne() function from the package of the same name. Some important parameters are:


4.1.1 Extract information from our tsne object

Looking above at the result of our t-SNE analysis, we can notice a few things...

  1. We get back the number of objects we put in: 88.
  2. The element Y is an 88x2 matrix holding the coordinates for our new projection.
  3. There are no eigenvectors or other information about the variables or dimensions or how they were used in the projection!

That's right, there is no underlying information for mapping back to our original dataset. It's a completely black box with no way to reverse engineer the process. That's because the process itself is stochastic! Whereas PCA was a deterministic process - repeatable the same way every time with the same data - that is not the case for t-SNE. That's why even though it can be quite powerful in identifying local similarities, it does not provide a mathematical pathway back to our original data!

Let's extract the information, combine it with our sample information and project it using ggplot2


4.1.2 Interpreting our t-SNE plot

While we don't have a lot of samples, you can still see that we were able to cluster some of our data by cell types without providing that classification to the algorithm! Great job team!

We can see that we get close clustering of tissues like placenta, heart, and bowel. Our liver samples are kind of everywhere but perhaps using a different perplexity would provide different results.

One interesting thing we can see is that regardless of tissue type, we see some samples are clustering together based on case number - namely case numbers 1, 4 and 5 seem to have some strong underlying expression profiles that connect them across tissue samples. We may also be seeing false relationships so beware!


4.2.0 Uniform Manifold Approximation and Projection

This algorithm for projection is in the same flavour as t-SNE projection but has some differences including:

  1. Increased speed and better preservation of the global structure in your data
  2. A different theoretical foundation used to balance between local and global structure

What does that mean for us? Faster results, and more interpretive results! Otherwise the same issues can apply. The setup is slightly easier with few options to change if you leave the defaults. We can access umap() from the package of the same name. You may also alter the default methods by creating a umap.defaults object. More information on that here

For more tinkering, you can choose to use the uwot package instead where the umap() function has more options that are easily modified.

4.2.1 Extract information from our UMAP object

Looking at our UMAP object tissue_umap we see it houses the projection parameters used but also some additional variables:

  1. data: holds our original data matrix.
  2. layout: contains the projection coordinates we need for plotting the data.
  3. knn: a weighted k-nearest neighbour graph. This is a graph that connects each observation to its nearest k neighbours. This is generates the first topological representation of the data - like an initial sketch.

You may notice again that there is no data that suggests how we arrived at this solution. There are no eigenvectors or values to reverse the projection!

Let's extract the layout information, combine it with our sample information and project it using ggplot2


4.2.2 Interpreting our UMAP result

So it looks like without much tinkering we retrieved a fairly nice result. We can see now that liver and kidney are more closely grouped as tissues, while heart samples generally cluster together still. Bowel and jejunum appear spatially grouped and our placenta samples are still close to each other. The clustering we saw with samples 1, 4, and 5 appear to be less severe.

There does appear to be some structure between the lung samples in different case numbers so this might be an avenue to explore next to try and see if there truly is a relationship between these groups.


5.0.0 Class summary

Isthis_cluster.png
While t-SNE and UMAP produce projections do produce clustered data, you have no route back to understanding their relationships. PCA, on the other hand, is strictly a dimension reduction tool. It does not place or assign datapoints to any groups BUT it is useful to use on large datasets prior to clustering!

Today we took a deep dive into principal component analysis. There are of course different variants of this based on the assumptions you can make about your observations and variables like independent component analysis (ICA, non-Gaussian features) and multiple correspondence analysis (MCA, categorical features). Some additional methods can also be used to store the transformation like a PCA does, notably variational autoencoders (VAE).

Overall we should remember that while PCA can have problems in generating it's feature extraction, it is deterministic and repeatable. Also, the final results are provided in such a way that new observations could be transformed and projected onto the same principal components. You can also feed these components back into clustering algorithms like k-means to try and identify specific subgroups.

t-SNE and UMAP, on the other hand appear to do a much better job with high-dimensional data. They can preserve local structure and UMAP can also do a fairly good job of preserving global structure. These tools make for great exploratory analysis of your complex datasets. Interpretation of relationships, however, are not mathematically clear like in PCA. These are, after all projections from a higher dimension for our simpler primate brains!


5.1.0 Weekly assignment

This week's assignment will be found under the current lecture folder under the "assignment" subfolder. It will include a Jupyter notebook that you will use to produce the code and answers for this week's assignment. Please provide answers in markdown or code cells that immediately follow each question section.

Assignment breakdown
Code 50% - Does it follow best practices?
- Does it make good use of available packages?
- Was data prepared properly
Answers and Output 50% - Is output based on the correct dataset?
- Are groupings appropriate
- Are correct titles/axes/legends correct?
- Is interpretation of the graphs correct?

Since coding styles and solutions can differ, students are encouraged to use best practices. Assignments may be rewarded for well-coded or elegant solutions.

You can save and download the Jupyter notebook in its native format. Submit this file to the the appropriate assignment section by 1:59 pm on the date of our next class: April 14th, 2022.


5.2.0 Acknowledgements

Revision 1.0.0: created and prepared for CSB1021H S LEC0141, 03-2021 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.

Revision 1.0.1: edited and prepared for CSB1020H S LEC0141, 03-2022 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.


5.3.0 References

More information on calculating optimal clusters: https://www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3-must-know-methods/

Step-by-step how PCA works: https://builtin.com/data-science/step-step-explanation-principal-component-analysis

More PCA explanations here: https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues

The math of PCA: https://www.cs.princeton.edu/picasso/mats/PCA-Tutorial-Intuition_jp.pdf

t-SNE in R and Python: https://datavizpyr.com/how-to-make-tsne-plot-in-r/

All about UMAP: https://umap-learn.readthedocs.io/en/latest/basic_usage.html

CAGEF_new.png